{ "cells": [ { "cell_type": "markdown", "source": [ "# Pandas Dataframe Exercises\n", "## Try me\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ffraile/computer_science_tutorials/blob/main/source/Data%20Manipulation/exercises/Pandas%20Dataframes.ipynb)[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ffraile/computer_science_tutorials/main?labpath=source%2FData%20Manipulation%2Fexercises%2FPandas%20Dataframes.ipynb)\n", "\n", "In the first exercises, we are going to use the open dataset from the National Institute of Diabetes and Digestive and Kidney Diseases which is available in Kaggle. The dataset contains information about patients with diabetes.\n", "You can find it in this URL:\n", "\n", "https://www.kaggle.com/uciml/pima-indians-diabetes-database\n", "\n", "We have downloaded the dataset and we have uploaded it to the repository of the course. You can find it in the following URL:\n", "\n", "https://raw.githubusercontent.com/ffraile/computer_science_tutorials/main/source/Data%20Manipulation/exercises/datasets/diabetes.csv'\n", "\n", "The dataset contains the following columns:\n", "\n", "* Pregnancies: Number of times pregnant\n", "* Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test\n", "* BloodPressure: Diastolic blood pressure (mm Hg)\n", "* SkinThickness: Triceps skin fold thickness (mm)\n", "* Insulin: 2-Hour serum insulin (mu U/ml)\n", "* BMI: Body mass index (weight in kg/(height in m)^2)\n", "* DiabetesPedigreeFunction: Diabetes pedigree function\n", "* Age: Age (years)\n", "* Outcome: Class variable (0 or 1)\n", "* 268 of 768 are 1, the others are 0\n", "* Class Distribution: (class value 1 is interpreted as \"tested positive for diabetes\")\n", "\n", "The following code loads the dataset into a Pandas dataframe:" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "source": [ "import pandas as pd\n", "diabetes_pd = pd.read_csv('https://raw.githubusercontent.com/ffraile/computer_science_tutorials/main/source/Data%20Manipulation/exercises/datasets/diabetes.csv')\n", "diabetes_pd" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" }, "ExecuteTime": { "end_time": "2025-12-25T18:44:29.062132Z", "start_time": "2025-12-25T18:44:28.488505Z" } }, "outputs": [ { "data": { "text/plain": [ " Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\n", "0 6 148 72 35 0 33.6 \n", "1 1 85 66 29 0 26.6 \n", "2 8 183 64 0 0 23.3 \n", "3 1 89 66 23 94 28.1 \n", "4 0 137 40 35 168 43.1 \n", ".. ... ... ... ... ... ... \n", "763 10 101 76 48 180 32.9 \n", "764 2 122 70 27 0 36.8 \n", "765 5 121 72 23 112 26.2 \n", "766 1 126 60 0 0 30.1 \n", "767 1 93 70 31 0 30.4 \n", "\n", " DiabetesPedigreeFunction Age Outcome \n", "0 0.627 50 1 \n", "1 0.351 31 0 \n", "2 0.672 32 1 \n", "3 0.167 21 0 \n", "4 2.288 33 1 \n", ".. ... ... ... \n", "763 0.171 63 0 \n", "764 0.340 27 0 \n", "765 0.245 30 0 \n", "766 0.349 47 1 \n", "767 0.315 23 0 \n", "\n", "[768 rows x 9 columns]" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome
061487235033.60.627501
11856629026.60.351310
28183640023.30.672321
318966239428.10.167210
40137403516843.12.288331
..............................
76310101764818032.90.171630
76421227027036.80.340270
7655121722311226.20.245300
7661126600030.10.349471
7671937031030.40.315230
\n", "

768 rows × 9 columns

\n", "
" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 1 }, { "cell_type": "markdown", "source": [ "1. Display the statistical summary of the dataset. What can you say about the dataset?" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "2. Use the query function to find the average BMI of the patients with diabetes and the average BMI of the patients without diabetes. What can you say about the results?\n" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "For the next exercises, we are going to use the datasets of Datahub, which is a platform for sharing and discovering open data. First, we are going to use the COVID dataset, which contains information about the COVID-19 pandemic. You can find it in the following URL:\n", "\n", "https://datahub.io/core/covid-1" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 5, "outputs": [ { "data": { "text/plain": " Date China US United_Kingdom Italy France Germany \\\n0 2020-01-22 548 1 0 0 0 0 \n1 2020-01-23 643 1 0 0 0 0 \n2 2020-01-24 920 2 0 0 2 0 \n3 2020-01-25 1406 2 0 0 3 0 \n4 2020-01-26 2075 5 0 0 3 0 \n.. ... ... ... ... ... ... ... \n728 2022-01-19 118370 68684431 15610069 9219391 15288014 8361262 \n729 2022-01-20 118470 69329860 15718193 9418256 15715329 8502132 \n730 2022-01-21 118544 70209840 15814617 9603856 16116748 8635461 \n731 2022-01-22 118616 70495874 15891905 9781191 16506090 8716804 \n732 2022-01-23 118773 70699416 15966838 9923678 16807733 8773030 \n\n Spain Iran \n0 0 0 \n1 0 0 \n2 0 0 \n3 0 0 \n4 0 0 \n.. ... ... \n728 8676916 6231909 \n729 8834363 6236567 \n730 8975458 6241843 \n731 8975458 6245346 \n732 8975458 6250490 \n\n[733 rows x 9 columns]", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
DateChinaUSUnited_KingdomItalyFranceGermanySpainIran
02020-01-225481000000
12020-01-236431000000
22020-01-249202002000
32020-01-2514062003000
42020-01-2620755003000
..............................
7282022-01-191183706868443115610069921939115288014836126286769166231909
7292022-01-201184706932986015718193941825615715329850213288343636236567
7302022-01-211185447020984015814617960385616116748863546189754586241843
7312022-01-221186167049587415891905978119116506090871680489754586245346
7322022-01-231187737069941615966838992367816807733877303089754586250490
\n

733 rows × 9 columns

\n
" }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Uncomment to use the version in datahub\n", "# covid_pd = pd.read_csv('https://datahub.io/core/covid-19/r/countries-aggregated.csv')\n", "covid_pd = pd.read_csv('https://raw.githubusercontent.com/ffraile/computer_science_tutorials/main/source/Data%20Manipulation/exercises/datasets/covid.csv')\n", "\n", "covid_pd" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "3. Use the data functions to create another column with the month of the date.\n" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "4. Now, group the data by month and country and create a dataset containing the total number of confirmed cases and deaths for each month and country. What can you say about the results?" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "5. For this exercisee, we are going to use another public repository of data, called [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). In this repository, you can find a lot of datasets for different purposes. We are going to use the dataset of the [Wine Quality](https://archive.ics.uci.edu/ml/datasets/wine+quality) dataset.\n", "Download the dataset as a CSV file, then import it into a Pandas dataframe. What are the columns of the dataset? Display the statistical summary." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n", "is_executing": true } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }